In this project a simple Data Analysis is performed on covid-19 Data.
Data Source: https://data.world/covid-19-data-resource-hub/covid-19-case-counts/workspace/file?filename=COVID-19+Cases.csv
Data Dictionary: https://data.world/covid-19-data-resource-hub/covid-19-case-counts/workspace/data-dictionary
Step 1: Load the libraries and data
import os
import numpy as np
import pandas as pd
data = pd.read_csv('COVID-19-Cases.csv')
data.head()
pandas read_csv function is used to load the csv file into pandas dataframe.
head function displays the top 5 records in the dataframe.
From the top 5 records It can be noticed that there are many missing values in the dataset.
data.shape
shape function is used to display number of rows and columns in a dataframe.
There are 92070 records and 13 fields in the dataset.
Step2: Data Summary
data.describe()
describe function displays the summary statistics of a dataframe like mean, median, quantiles, min and max values, by default it selects fields containing numeric datatype values.
Notice that minimum number of cases are zero. 75 percentile is 1 meaning 75% of records have number of cases less than 1 and maximum number of cases are 101739.
data.select_dtypes('object').describe()
There are 2 types of cases, 177 country regions, 132 province states,1845 Admin2(this refers to county)
Province states data is available only for the below country regions: Australia, Canada, China, Denmark, France, Netherlands, United Kingdom, United States
Admin2 and combined_key data is available only for US region.
print("Different case types are ",list(data['Case_Type'].unique()))
data['Date'] = pd.to_datetime(data['Date'])
print("Data is collected from ",min(data['Date'])," to ",max(data['Date']))
print("Different Table Names from which data is collected ",list(data['Table_Names'].unique()))
Step3: Handling Missing Values
data.isnull().sum()
Province_State, Admin2, Combined_Key, FIPS have missing values as the data for these fields is not collected for all the country regions.
Number of cases is missing in 324 records and latitude and longitude information is missing in 354 records. Lets explore data further and see if any insights can be gained about why these data could be missing.
missing_cases_data = data[data['Cases'].isnull()]
missing_cases_data.head()
print(missing_cases_data['Date'].unique())
print(missing_cases_data['Country_Region'].unique())
print(missing_cases_data['Table_Names'].unique())
The missing cases are all from US Daily Summary data from 23th March to 30th March.
missing_lat_data = data[data['Lat'].isnull()]
missing_lat_data.head()
print(missing_lat_data['Country_Region'].unique())
missing_lat_data['Country_Region'].value_counts()
data[(data['Country_Region'] == 'US')|(data['Country_Region'] == 'Cruise Ship')]['Country_Region'].value_counts()
As Cruise Ship is not a region it has missing latitude and longitude values. There are 138 regions in US with no latitude and longitude information. Lets explore further to understand why the data is missing.
print(missing_lat_data[missing_lat_data['Country_Region'] == 'US']['Table_Names'].unique())
print(missing_lat_data[missing_lat_data['Country_Region'] == 'US']['Date'].unique())
print(missing_lat_data[missing_lat_data['Country_Region'] == 'US']['Cases'].unique())
The missing location information in US is during the duration 23rd March to 30th March when cases are missing.
While performing analysis we will just ignore the missing data.
Step4: Plots
data.head()
import plotly.express as px
fig = px.pie(data,names='Case_Type', title='Number of cases of each type')
fig.show()
There is 50% data with deaths and 50% data with confirmed cases. This is beacuse for each country, region both deaths data and confirmed cases data is collected on each day.
fig = px.pie(data,names='Country_Region', title='Number of records for each region')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()
62.7% of records have US data, the main reason behind this is in US data is displayed for each State and county as well. The other best way to view number of data points for each country is to filter data by selecting country and unique dates.
fig = px.pie(data,names='Country_Region', title='Number of records for each region')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()